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High-density oligonucleotide arrays can be used to rapidly examine large amounts of DNA sequence in a high 
throughput manner. An array designed to determine the specific nucleotide sequence of 705 bp of the rpoB 
gene of Mycobacterium tuberculosis accurately detected rifampin resistance associated with mutations of 44 clinical 
isolates of M. tuberculosis. The nucleotide sequence diversity in 121 Mycobacterial isolates (comprised of 10 species) 
was examined by both conventional dideoxynucleotide sequencing of the rpoB and 16S genes and by analysis of 
the rpoB oligonucleotide array hybridization patterns. Species identification for each of the isolates was similar 
irrespective of whether 16S sequence, rpoB sequence, or the pattern of rpoB hybridization was used. However, for 
several species, the number of alleles in the 16S and rpoB gene sequences provided discordant estimates of the 
genetic diversity within a species. In addition to confirming the array's intended utility for sequencing the 
region of M tuberculosis that confers rifampin resistance, this work demonstrates that this array can identify the 
species of nontuberculous Mycobacteria. This demonstrates the general point that DNA microarrays that 
sequence important genomic regions (such as drug resistance or pathogenicity islands) can simultaneously 
identify species and provide some insight into the organism's population structure. 

[The sequence data described in this paper have been submitted to GenBank under accession nos. AF09766- 
AF059853 and AF060279-AF060367.] 



For patients infected wilh Mycobacteria, especially 
those coinfecLed with the human immunodefi- 
ciency virus type 1 and type 2 (HIV-1, HIV-2). the 
identity of the Mycobacterium species and the pres- 
ence of mutations that confer both biologically and 
clinically important phenotypes are of critical im- 
portance. Both of these issues have implications for 
the appropriate care and treatment of the infected 
patient. For example, although M avium complex 
(MAC) is the most common cause for both dissemi- 
nated Mycobacterium disease and death in patients 
with AIDS in the developed world (~-25%~50% of 
adults and 10% of children with AIDS are infected 
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Ilnderlied et al. 1 993} , Mycobacterium tuberculosis in- 
fections are also found in these patient populations. 
Important public health and patient management 
decisions (e.g., the need for clinical isolation and 
the choice of the appropriate therapeutic regimen) 
depend on a timely and accurate identification of 
the infecting agent. Additionally, almost 10% of 
new M. tuberculosis patients in the United States 
show resistance to at least one of the first line anti- 
tuberculosis drugs (isoniazid (INH), pyrazinamide 
[PZAJ, rifampin [RIFJ, ethambutol (EMB), and strep- 
tomycin |STR| with ~2%-3% of cases resistant to 
bolh INH and RIF (Moore et al. 1997). 

On the basis of the insights provided by previ- 
ously characterized RIF resistant mutants in Esch- 
erichia coli (Ovchinnikov et al. 1983; Jin and Gross 
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1988), mutations in the p-subunit of the RNA poly- 
merase (rpoB gene) of M tuberculosis were first iden- 
tified and characterized by Telenti et al. (1993). Ap- 
proximately 90%-95% of RIF-resistant M. tuberculo- 
sis strains have been found to possess mutations in 
an 81 -bp section (Musser 1995) of die 3534-bp cod- 
ing region of the rpoB gene (Miller et al. 1994). In- 
terestingly, of the 122 M. tuberculosis isolates ana- 
lyzed by Telenti et al. (1993), no polymorphisms 
other than those conferring drug resistance were ob- 
served in 4 11 nucleotides analyzed in each sample. 
In this study, we have explored the sequence diver- 
sity in a larger segment of the ipoB gene for 10 spe- 
cies of the Mycobacterium genus. By use of a high- 
density oligonucleotide array to derive both hybrid- 
ization patterns and nucleotide sequences, 
information from a conserved 705 -bp region of the 
rpoB gene permitted the simultaneous species iden- 
tification speciation of 121 isolates from 10 Myco- 
bacterium species as well as the detection of muta- 
tions that confer RIF resistance in 4 1 M. tubciculosis 
isolates. 

Genotypic analyses of the Mycobacterium species 
isolates (Table 1) used in this study were performed 
with a high-density oligonucleotide array (DNA 
chip) with probes complementary to the M. tubercu- 
losis rpoB gene sequence. The array served as a ge- 
neric geno typing chip that provided both specific 
nucleotide sequence as well as patterns of hybrid- 
ization highly specific for each species. This demon- 
strates the general capability of such an array to pro- 
vide important clinically relevant and biological 
information about the rpoB genes of related Myco- 
bacterial organisms that have not been sequenced 
previously. 

RESULTS 

Rifampin-Conferring Mutations in the rpoB Gene 
of M tuberculosis 

A total of 705 of the 3534 nucleotides of the M. 
tuberculosis rpoB gene was analyzed by use of a high- 
density oligonucleotide array (Fig. 1). Although this 
segment of the rpoB gene has a GC content of 
67.7%, it was resequenced by use of the array with 
100% concordance with d id eoxy nucleotide meth- 
odology. A collection of 63 M tuberculosis isolates 
gathered from New York and San Francisco areas 
was analyzed by the M tuberculosis rpoB array for 
mutations that confer rifampin resistance. The re- 
sistance to rifampin for 44/63 samples was deter- 
mined prior to the genotypic analyses by collabo- 
rating laboratories and the results kept confidential 



from us until completion of genotypic analyses. Of 
the 44 M. tuberculosis isolates that were phenotypi- 
cally resistant to rifampin, 40 possessed mutations 
associated previously with resistance. One addi- 
tional isolate (TB40) displayed a mutation (Gln-513 
Glu) not described previously. Each array -derived 
nucleotide sequence was confirmed by use of con- 
ventional dideoxynucleotide sequencing. Muta- 
tions at codons 531. 526, and 513 {E. coli codon 
numbering system) were observed to occur most fre- 
quently in the rifampin resistant isolates at 35%, 
28%, and 25%, respectively (Fig. 1). Mutations were 
not observed in any of the 705 nucleotides analyzed 
of the rpoB gene for three of the phenotypicaliy re- 
sistant isolates by use of either array or dideoxy- 
nucleotide sequencing methodologies. 



Allelic Frequency and Species-Specific Polymorphisms 
Present in rpoB and the 16S Genes of Mycobacterium 

The rpoB and 16S genes from nine species of Myco- 
bacterium were analyzed at the nucleotide level by 
use of dideoxynucleotide-based methodology and 
compared with the M. tuberculosis sequence of these 
genes. A total of 83 and 82 Mycobacterium isolates 
were characterized for both the rpoB and 16S genes, 
respectively (Table 2) . In comparison with M. tuber- 
culosis, an average of 80 polymorphic positions were 
observed within the 705 nucleotides of rpoBtor each 
species (Table 2A) and on the average, 21 polymor- 
phic positions were seen within the 180 nucleotides 
of the 1 6S gene (Table 2B) . Several of the polymor- 
phic positions were observed to be species specific, 
with a subset of these present in every isolate of that 
species (conserved polymorphism) . For each of 63 
M. tuberculosis isolates, no other polymorphic se- 
quences were observed in the rpoB gene, except for 
mutations conferring resistance to rifampin. 

Interspecies variation for the 705 nucleotide re- 
gion of the rpoB gene ranged from 14.3% (Micobac- 
terium chelonae and M xenopi) to 4.1% (M. avium and 
M. scrofulaceum) (Fig. 2). Intraspecies variation for 
rpoB was highest for M. smegmatis (4.2%) and M. 
gordonae (3.7%) with M. tuberculosis and M. xenopi 
exhibiting no nucleotide variation in >70 isolates 
analyzed. For both genes, M. kansasii and M. intra- 
cellulare isolates displayed only one- fifth to one- 
third as many alleles as isolates examined, whereas 
M. smegmatis and M. for tui turn displayed as many 
alleles as isolates examined (Table 3). Interestingly, 
a contrasting view of the diversity within a species 
group was observed depending on whether the rpoB 
or 16S genes were examined. For example, for M. 
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Figure 1 (A) Analysis of a 705-bp region (codons 482-715, E. coli numbering) of the Mycobacterium rpoB gene 
(1 1 64 amino acids). (B) Region encompassing the 81 -bp domain, described previously as having all of the mutations 
that correlate with the decreased sensitivity to rifampin (Musser 1995). Arrows indicate locations of PCR amplifi- 
cation primers and all of the primers used for dideoxynucleotide sequencing (see Methods for sequence). (Q Codon 
positions within the 81 -bp region, containing mutations in 41 of the 44 RIF-resistant M. tuberculosis mutants. The 
frequency and type of mutation observed at each codon is presented. 



xenopi, one and three alleles for rpoB and 16S genes, 
respectively, were observed. In contrast, Af. intracel- 
lular exhibited one and four alleles for the 16S and 
rpoB genes, respectively. Finally, M. tuberculosis and 
M fortuitum were similar in complexity of their spe- 
cies groups as exemplified in the number of alleles 
observed for the 16S and rpoB Genes. 

Species Identification Based on DNA Sequences of 16S 
and rpoB Genes 

DNA sequence analysis of 705 bases of the rpoB and 
180 bases of the 16S genes for 81 of the 121 Myco- 
bacterium isolates (Table 1) was determined by use of 
conventional dideoxynucleotide methodology. 
Analyses of these sequences permitted the cluster- 
ing of each of the isolates into groups on the basis of 
either the 16$ or rpoB sequences (Fig. 3). The confi- 
dence values for each species clusters indicated the 
groupings were veiy stable. The bootstrap values 
ranged from 100% to 69.3% for 16S gene sequences 
and from 100% to 71% for the rpoB gene sequences. 
The lowest confidence values were observed in clus- 
tering M. fortuitum and M. scrofulaceuin isolates with 



16S and rpoB sequences, respectively. Of the 81 iso- 
lates, 75 were grouped into the same species clusters 
by use of either gene sequences. Five of the 81 iso- 
lates (m4, m36, m48, m66, ml 25) were assigned to 
different species clusters, depending on whether the 
16S or rpoB sequence was used as the basis for analy- 
sis. Specifically, species identification on the basis of 
the rpoB sequence for three of the five isolates 
(ml 25, m48, m4), was in agreement with that as- 
signed by standard microbiological methods used 
by the providing laboratories, but differed in assign- 
ment on the basis of 16S sequence analysis. Two of 
the five isolates (m36, m66) were placed in one of 
three different species clusters depending on the 
gene sequence analyzed or the microbiological as- 
say used. 

Analyses of both 16S and rpoB sequences to- 
gether were useful in providing more precise species 
identification or clarification for 10 of the 81 iso- 
lates. Of these 10 isolates, 7 (m91, ml04, m68, mG4, 
m65, m67, m71) were identified as MAG or avium- 
iniracellularc by the laboratories of origin. On the 
basis of both 16S or rpoB sequence analyses, six of 
the seven isolates were unambiguously identified as 
M, avium, with m68 identified as M. xenopi. The re- 
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maining three isolates m77, m27, and m28 were 
identified by microbiological assays as M. smegmatis, 
M. avium, and M. avium, respectively. Analysis with 



both gene sequences clustered these isolates as M. 
fortuitum (m77), M. intracellulare (m27), and M. in- 
tracellulare (m28), respectively. 
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Figure 2 Interspecies and intraspecies variation ob- 
served a 10 species of Mycobacterium. Values repre- 
sented are percent variation based on 705 bp of the 
rpoB gene for M. tuberculosis (Mt), M. avium (Ma), M. 
intracellulare (Mi), M. gordonae (Mg), M. chelonae 
(Mc), M. xenopi (My), M. scrofulaceum (Msc), M. smeg- 
matis (Ms), M. kansasii (Mk), M. fortuitum (Mf). Data 
from M. scrofulaceum are represented by two isolates. 
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1 6S Clusters rpoB Clusters 




Figure 3 Representation of clusterings of Mycobacterium isolates among 10 species of Mycobacterium. The tree is 
unrooted and is based on the nearest neighbor joining method for the 1 80- and 705-bp DNA sequences of the 1 6S 
and rpoB genes, respectively, from each of the isolates. Stability of the clusters were evaluated by use of 1 000 cycles 
of bootstrapping (values on tree branches). The six isolates that are assigned to different clusters, depending on the 
gene sequence used, are highlighted in gray (see text for discussion). 



Species Identification Using Hybridization Pattern 
Recognition Analysis of High-Density 
Oligonucleotide Arrays 

The high-density oligonucleotide array used to de- 
tect the mutations conferring rifampin resistance in 
M, tuberculosis was also used to simultaneously 
genotype and speciate nontuberculous isolates. As 
the interspecies sequence variation for the rpoBgene 
of some of the nontuberculous isolates was >10% 
(Fig. 2), significant portions of the hybridization 
patterns produced from nontuberculous rpoB ampli- 
cons were unique (Fig. 4 A). Each hybridization pat- 
tern can be represented as a plot of the fluorescence 



intensities as a function of the base position in the 
sequence of rpoB (Fig. 4B), When nontuberculous 
DNA amplicons were hybridized, the fluorescence 
intensities of the many of the interrogating probes 
were reduced within the regions of the rpoB se- 
quences (Fig. 4 A). This reduction in hybridization 
intensity affected the ability to determine specific 
nucleotides, because allele-specific probes for some 
of the interrogated bases do not exist on the array 
(Fig. 4C; Table 4). However, even though the arrays 
were designed for the M. tuberculosis gene sequence, 
the sequence of most of the polymorphic positions 
for each species could be determined (see legend to 
Fig. 4C). Repeated measurements indicated that 
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Figure 4 (^) Hybridization patterns produced on an oligonucleotide array that have probes selected to be 
complementary to the 705 bases of M. tuberculosis rpoB gene sequence. The amplified, fluorescently labeled 705-bp 
antisense product from the rpoB genes of M tuberculosis and M gordonae are presented. A total of 5648 oligo- 
nucleotide probes were used to interrogate each of the 705 bp in the amplified product, (fl) The intensity of 
hybridization for each of the 705 probes that are complementary to the wild-type sequence (Miller et al. 1993) of 
the M. tuberculosis rpoB gene is plotted as a function of the base interrogated in the gene sequence. The blue and 
red plots are the intensity profiles of M. tuberculosis and M. gordonae images shown in A. The intensities are obtained 
from GeneChip software (see Chee et al. 1996; Kozal et al. 1996) and are plotted and compared using the Ulysses 
software program (Chee et al. 1 996). (Q The identity of each base in the 705 bp of the rpotfampiicons is determined 
by the hybridization results of eight probes (four for each strand). The sequences derived from the images in A for 
M. tuberculosis (M-tb) and M. gordonae (M-go) are shown. Differences between the two genes are denoted by 
highlighted bases in the M. gordonae sequence. Of these differences, 61% (95/155) of the positions can be 
identified as specific polymorphic differences between the two species. The remainder of the differences are 
unidentified or marked by IUPAC ambiguity code. Of the positions identified as a polymorphic difference 7/33 bases 
correspond to species-specific polymorphisms present in all isolates of M gordonae (Table 2). 



highly specific and reproducible hybridization pat- 
terns characteristic of each of the Mycobacterial spe- 
cies could be obtained. 

The hybridization patterns produced on the M. 
tuberculosis rpoB arrays for each of 121 Mycobacte- 
rium isolates were analyzed by use of two algorith- 
mic approaches to carry out species identification. 



The first algorithm used a straightforward linear re- 
gression analysis of the 1410 (705 bases on each 
strand) intensities for the probes discovered to be 
complementary to the wild-type sequence of the M. 
tuberculosis rpoB gene. This approach selectively ana- 
lyzed each of the Mycobacterium isolate sequences 
with only 25% of the probes present in the array. 
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No. of 







Total no. 


polymorphic 




Correct calls 


polymorphic 


positions 


Species 


(bases) 8 (%) 


positions* 


identified (%) 


M. avium 


467 (72) 


56 


38 (68) 


M. chelonae 


533 (82) 


85 


29 (34) 


M. fortuitum 


404 (62) 


86 


62 (72) 


M. gordonae 


449 (69) 


71 


44 (62) 


M. intracellulars 


479 (73) 


58 


50 (86) 


M. kansasii 


446 (68) 


67 


51 (76) 


M. scrofulaceum 


443 (68) 


62 


47 (75) 


M. smegmatis 


353 (54) 


94 


51 (54) 


M. xenopi 


432 (66) 


72 


51 (71) 



a Correct calls of nucleotides based on 653 total bases analyzed. 
b Total number of polymorphismic positions detected using dideoxynucleo 
tide sequences. 



ters to be highly stable. The values 
ranged from 100% (M smegmatis, M. xe- 
nopi, M. kansash) to 77.6% {M. intracellu- 
lar). The membership in each of the ma- 
jor species clusters were identical to 
those clusters derived from the rpoB se- 
quences with the exception of some of 
the six isolates highlighted in the DNA 
clusterings (Fig. 3). Of the five isolates 
that clustered in different species groups, 
depending on the gene sequence used to 
carry out the cluster analysis (m4, m36, 
m48, m66, ml 25), only samples m36 
and ml 25 were not grouped as they were 
by their rpoB DNA sequence. 

DISCUSSION 



The results of this clustering were represented in a 
color contour plot (Fig. 5A). The color of each pair- 
wise comparison of the 1 - i z values correlated to 
highly correlated (1-blue) to uncorrected (0-red) 
pairs. Interestingly, isolates ml 25, m66, and m4, 
which, on the basis of their DNA sequences, were 
observed to group outside of their biochemical/ 
microbiological ly defined species designations, are 
also observed to cluster outside their biochemically 
defined groups by tliis method. Most notably, all of 
the other isolates were clustered into the same spe- 
cies groups as predicted by the 16S and rpoB se- 
quences. 

The second algorithmic approach used to ana- 
lyze individual hybridization patterns was based 
on principal component analysis of all of the 5640 
probes (705 bases X 2 strands x 4 probes/ 
interrogated base) on the rpoB array. Unlike the ear- 
lier linear regression analysis, in this approach, no 
prioritization for the M. tuberculosis derived, perfect 
match probes were used. The most informative 
probes were identified by variance-based variable re- 
duction followed by the principal component 
analysis of the co variance matrix and reduced the 
total probe set 15 orthogonal components. These 
components accounted for 93% of the observed 
variability in the probe intensities. By use of the 
same hierarchical clustering procedure used to 
group the isolates on the basis of linear regression 
correladon coefficients, a single linkage clustering 
result is displayed as a hierarchical tree structure 
(Fig. 5B). Confidence values again showed the clus- 



Hybridization of DNA to high-density 
oligonucleotide arrays offers the possibil- 
ity of examining large amounts of se- 
quence with a single hybridization step. 
The utility of this approach was recently demon- 
strated by the complete analysis of the entire hu- 
man mitochondrial genome (Chee et ai. 1996). 
Other applications of these arrays have included the 
sequence analysis of viral (Kozal et al. 1996) and 
human genomic sequences (Hacia et al. 1996), 
quantitative measurements of multiple murine 
(Lockhart et al. 1996) and yeast (Wodicka et al. 
1997) gene expression and functional mapping of 
the yeast genome (Shoemaker et al, 1996). In each 
of these applications, oligonucleotide probes were 
selected and synthesized on the arrays as specific 
complements to each interrogated nucleotide in the 
targeted sequence. In this report, we have pursued 
the strategy of synthesizing an array that is com- 
posed of oligonucleotide probes specifically comple- 
mentary to the M. tuberculosis rpoB gene and by use 
of this array as a generic tool to analyze the same 
gene from nine other nontuberculosis Mycobacte- 
rial species. The use of this array allowed for the 
simultaneous detection of mutations that confer ri- 
fampin resistance as well as species identification. 

In the most straightforward use of this array, 41 
M. tuberculosis isolates were observed to possess a 
total of 12 mutant rpoB alleles of involving 8 codons 
of the rpoB gene. Forty of the mutations were of the 
missense type and one mutation was a 6-base dele- 
tion. Three other M. tuberculosis isolates (TB10, 
TBI 5, TB76) were found to be phenotypically resis- 
tant but lacked any mutations within the sequenced 
705 bp of the rpoB gene. Such isolates are not unex- 
pected because -10% of isolates exhibit rifampin re- 
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Figure 5 (A) Results of hybridization patterns of 121 Mycobacterium isolates analyzed by linear regression assays 
(SAS Institute 1 990). Only probes complementary to the wild-type sequence of the rpoB gene of M. tuberculosis were 
used in the analysis (i.e., 25%) of probes. The pairwise comparison of the 1 - r 2 values for each of the 1 21 isolates 
was performed and like values clustered. The contour plot represents values for 1 (purple) to 0 (red). (Mx) M. xenopi, 
(Mt) M. tuberculosis, (Mg) M. gordonae, (Mf) M. fortuitum, (Msc) M, scrofulaceum, (Mi) M. intracellular, (Mc) M. 
chelonae, (Ms) M. smegmatis, and (Mk) M. kansasii. (B) Results of hybridization patterns analyzed by use of principle 
component assays. Using all of the 5648 probes on the rpoB array, each of the isolates was clustered on the basis 
of 1 5 orthogonal components. The clustering of the isolates was represented by an unrooted nearest neighbor 
joining tree. The six isolates noted in Fig. 3 are highlighted in this tree. 



si stance through an unknown mechanism (Heym et 
al. 1994; Kapur et al. 1994; Williams et al. 1994). 

Because rifampin resistance-conferring muta- 
tions in rpoB have also been observed in nontuber- 
culous Mycobacteria species (Honore and Cole 
1993; Levin and Hatfull 1993; Musser 1995) and be- 
cause the use of 16S genotyping has proven prob- 
lematic in sped a ting some mycobacteria (Fox et al. 
1992), the rpoB oligonucleotide array and conven- 
tional dideoxynucleotide sequencing were used to 
analyze sequence diversity within and between 
members of the mycobacterial genus. Among the 10 
Mycobacterium species studied, analysis of the 705 
nucleotides of the rpoB gene revealed that intraspe- 
cies variation was smallest within M. tuberculosis, M. 
xenopi, and M. kansasii species and greatest within 
M. gordonae, M. fortuitum, and M. smegmatis species. 
For some of these species, like M. gordonae, this re- 



sult would not be unexpected as they are known to 
be a heterogeneous group. However, interspecies 
variation of the rpoB gene ranged considerably with 
M. avium and M. scrofulaceum, displaying the least 
and M. chelonae and M. xenopi displaying the great- 
est variation (Fig. 2). Importantly, the sequence 
analysis identified several species-specific single 
nucleotide polymorphisms among the ten Mycobac- 
terium species. Hunt et al. (1994) described the pres- 
ence of three M. tuberculosis-speciiic signature 
nucleotides while examining a 180-bp region of 
rpoB for resistance conferring mutations. In this 
study, we have identified many nucleotide posi- 
tions for each of the nine non tuberculosis species, 
which when considered in a combinatorial fashion, 
provide a unique set of fingerprints for a particular 
Mycobacterium species. 

When the DNA sequences of the rpoB and 1 6S 
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genes were compared to measure the extent of alle- 
lic variation in each of 93 and 83 Mycobacterium 
isolates (Table 3), respectively, very different per- 
spectives of the genomic diversity within each spe- 
cies were obtained. For M. tuberculosis, M. scrofula- 
cewn, M. cheilonae, and M kansasii, allelic variation 
in both genes was similar and relatively low. Thus, 
the analysis of both genes provided similar indica- 
tions of the diversity in each of the respective ge- 
nomes. The observation of 3 rpoB alleles in a collec- 
tion of 1 1 isolates of M. kansasii is consistent with 
the identification of five subspecies for these species 
(Picardeau et al. 1997). A similar representative pic- 
ture was observed for the 16S and rpoB genes of M. 
avium, M. smegmatis, and M. fortuitum, although the 
overall allelic diversity was greater. However, analy- 
sis of the genetic profiles for M. xenopi, M. intracel- 
lular and M gordonae yielded a different picture of 
genomic variability within each species. The allelic 
variation observed in M. xenopi by use of rpoB was 
low, whereas the variation in the 16S gene se- 
quences for the 13 isolates studied in this species 
was higher. Conversely, the 16S sequence variation 
observed in isolates of M. intracellulare and M. gon- 
donae indicated relatively lower levels of sequence 
diversity within these species, whereas rpoB se- 
quences suggested higher levels of genetic diversity. 
When the DNA sequences of the rpoB or 16S gene 
sequences were used to cluster each of 81 Mycobac- 
terium isolates into species groups, by use of an un- 
rooted neighbor joining distance matrix, 76 of the 
isolates were clustered into the same groups irre- 
spective of which gene sequence was used. Of the 
five isolates (m4, m36, m48, m66, ml 25) that were 
clustered into different groups depending on which 
gene sequence was used, the sequences of the rpoB 
gene (m4, m48, ml 25) grouped three of the isolates 
in a manner similar to the grouping derived from 
the biochemical classifications performed by the 
contributing microbiological laboratories. Of the re- 
maining isolates, two isolates (m36, m66) were 
placed in three different groupings. 

These results emphasize several points. First, the 
identification of species on die basis of the sequence 
diversity present in 705 nucleotides of the rpoB gene 
was similar to species assignments derived from an 
analysis of 180 bp of IBS sequences. In a similar 
manner, Kapur et al. (1994), also provided evidence 
that polymorphisms present in the 65-kD heat 
shock protein gene could be used to identify myco- 
bacterial species. Second, because several species of 
Mycobacterium exhibited an apparently large num- 
ber of 1 6S and/or rpoB alleles, the use of single rep- 
resentative isolates for each species will not provide 



a sufficiently large database to accurately classify 
such isolates. A larger database of nucleotide se- 
quences, derived from several regions of the ge- 
nome, will be needed to determine species and sub- 
species identification accurately. Third, for those 
Mycobacterium species for which different genes pro- 
vided different indications of the extent of genetic 
diversity, additional regions of the genome will 
need to be characterized. Those regions to be se- 
quenced should reflect both rapidly and more 
slowly evolving regions. This approach would aid in 
both speciation/subspeciation groupings and indi- 
vidual isolate tracking. 

Fox et al. (1992), made this point in their analy-. 
sis of two Bacillus species, noting that use of 16S 
gene sequences for species identification is most 
useful in distinguishing relationships between gen- 
era and will resolve some species but not more re- 
cently diverged species. Additional sequence sur- 
veillance could be made coincident with analysis of 
genes in which mutations confer drug resistance. 
For example, genotypic analysis of the catalase- 
peroxidase gene (katQ (Heymetal 1995; Musseret 
al. 1996) and the promoter region of the INH genes 
(Musser et al. 1996) for isoniazid resistance, ribo- 
somal protein S12 gene {rspL) and the 16S rRNA 
genes (rrs) for streptomycin resistance (Fin et al. 
1993; Sreevatsan et al. 1996), a subunit of DNA gy- 
rase A gene (gyrA) for fluoroquinolone resistance 
(Musser 1995), pyrazinamidase/nicotinamidase 
gene (pncA) for pyrazinamide resistance (Scorpio et 
al. 1992; Scorpio and Zhang 1996; Sreevatsan et al. 
1997) and the emb operon for ethamburtol resis- 
tance (Telenti et ai, 1997) could all be used to simul- 
taneously monitor drug resistance and provide data 
for species identification. An array interrogating 
many of these genes has been designed and synthe- 
sized recently and is currently being tested (Fig. 6). 

Finally, analysis of the hybridization patterns 
on the rpoB high-density oligonucleotide array pro- 
vides another level of important information about 
the identity of the Mycobacterium isolates. It is strik- 
ing that the grouping of the 121 Mycobacterium iso- 
lates based on hybridization patterns was virtually 
identical to the clustering obtained by the dideoxy- 
nucleotide-based DNA sequence of the rpoB gene. 
This result underscores two important conclusions: 
There are characteristic, conserved hybridization 
patterns for each of a species group and high- 
density arrays produce consistent results by use of 
different manufactured lots of arrays and multiple 
sample preparations. 

These results follow in the footpath of other 
strategies used to identify Mycobacterium species on 



444 #GEN0ME RESEARCH 



RESISTANCE DETECTION AND SPECIES IDENTIFICATION 



,V."Tr.CjE-v.v^ • ".-"V /'S-^Iv/ 

-t*\-y- . •■•/-•svv. ■•>•*••<■■ v ***^:- 
-sws-.v.- -raws*"-.* --.-■-* 

* VS5BB»t^-'."vn -.-- •.*. ----- 

'tCSi'-'-'v- ."*-.*-"-■_ -W-f*. .-.'V'-U-.^BA-"* r'-~>-; „■".-£■ 

-.v^«firv..^_ '■-'••^.-'V-v-^, .- 

•V''- ' ■?-?^- - ;-*-.' ' "-.■ : ^"*-"- '.v-'* ' *" " 



t-v^'-*^ T^vc. -r/yv>' 



- rpoB 

•katG 

•rpsL 
•16S 

J-gyrA 

]-inh orf 1 
T-dna J 

-hsp65 kd 
-32kd 



|- Special 
tilings for 
rpoB and 
katG 



Figure 6 A high-density oligonucleotide array used to genotype 731 
bp of rpoB, 2286 bp of katG, 356 bp of rpsL, 1683 bp of 16S, 731 bp 
of gyrA, 281 bp of inh orf, 341 bp of hsp 65 kd, 1097 bp of dnal and 
1 279 bp of 32 Kd genes. Additionally, specific insertion, deletions, and 
missense mutations in rpoSand katG axe interrogated by the alternative 
allele-specific oligonucleotide probes at the bottom of the chip. 



the basis of lipid profiles (Minnikinal and Goodfel- 
iow 1980; Butler et al. 1991; Lambert et al. 1996) or 
the fingerprint of isolates with Mycobacterium- 
specific genetic elements (for review, see Small and 
vanEmbden 1994). The most definitive fingerprint 
approach would involve the interrogation of most, 
if not all, of the nucleotides of the Mycobacterium 
genome. Currently, the genomes of two isolates 
(H37RV and CSU#93) of M tuberculosis are being 
sequenced. By use of the completed sequence as a 
template to develop a high-density array, nearly all 
of the nucleotides of the genome of other M. tuber- 
culosis isolates, as well as isolates from other Myco- 
bacterium species, could be analyzed. Pattern analy- 
sis algorithms like the ones presented in this study 
could be used to analyze the results of the genome- 
wide surveys to identify identical and divergent se- 
quence regions. For M tuberculosis isolates, regions 
with such sequence variations may be responsible 
for clinically or epidemiologically important phe- 
notypes. For non-tuberculosis isolates, the hybrid- 
ization patterns on a genome-wide level would per- 
mit grouping of isolates in a manner similar to that 
reported in this study. Furthermore, such a strategy 
holds the possibility that similar genomic arrays can 



be constructed for each of the major 
species of the eubacteria. These arrays 
could serve the dual role of surveillance 
of biologically/clinically important ge- 
nomic regions (i.e., drug resistance, 
toxins, pathogenicity factors) as well as 
allow for direct analysis of the bacterial 
genome for identification and epide- 
miological purposes. 

METHODS 

Bacterial Isolates, Preparation 
of Genomic DNA, 
and Drug Sensitivity Measurements 

Clinical and ATCC isolates of M. avium, M. che- 
loime, M. fortuitum, M, gordonae, M kansasii, M. 
intracellular, M. scrofulaceum, M. smegmatis. M. 
tuberculosis and M. xenopi were grown on Lo- 
wenstein-Jensen slants or in BACTEC 13A 
broth (Becton-Dickinson. Sparks, MD) (Table 
1) (Roberts et al. 1991). Species identification 
of each isolate used in this study in Table 1 was 
originally performed at the center that contrib- 
uted the sample. Methods used for species 
identification were the current biochemical, 
probe-based and microbacterial growth assays 
prevalent in each contributing group (Butler et 
al. 1 99 1 : Nolte et al. 1 995) . DNA obtained from 
most isolates were prepared by boiling one bac- 
terial colony in 100 pi of water for 10 mill (Kir- 
schner et al. 1993; Vaneechoutte et al. 1993). Cellular debris 
was removed by brief centrif ligation. For isolates obtained 
from Stanford University Medical Center. DNA was extracted 
as described elsewhere (vanEmbden et al. 1993). Drug suscep- 
tibility testing was performed with the proportion method 
with medium containing 1 pg/ml rifampin. The isolate was 
considered resistant if there was >1% growth on the rifampin 
containing medium as compared with the growth on the dnjg 
free media (Inderlied et al. 1995). 



PCR Amplification and Molecular Cloning 

A 705-bp segment of the rpoB gene from each Mycobacterium 
isolate was amplified by use of the following primer pair: 
rpoB-F (5'-CCCAGGACGTGGAGGCGATCACACCGCA-3') 
and rpoB-R (5'-CCTCCCCGCGTCGATCGCCCCGC-3'). 
Amplification of a 1433-bp region of the 16S genes was ac- 
complished with primers 16SF (5'- CTGCTTAACACATG - 
CAAGTCGA-3') and 16SR (o'-CAATCGCCGATCCCACCTT- 
3'). From the 100 pi boiled sample containing the Mycobacte- 
rium genomic DNA. 1-2 pi was amplified in a 100 pi PCR 
reaction. Each PCR reaction contained 200 iim of tleoxy- 
nucleotide triphosphates, 200 nM of each primer, 2.5 units of 
Taq polymerase (Boehringer Mannheim. Indianapolis, IN), 10 
ium Tris-HCl (pH 8.3). 50 m.M KC1, 1.5 rn.M MgCl 2l and 5% 
DMSO. PCR amplification reaction conditions consisted of an 
initial denaturation step of 5 mm: at 95°C. and 35 cycles con- 
sisting a denaturation step of 95"C for 1 mln, primer hybrid- 
ization at COX (68°C for M. tuberculosis) for 30 sec and poly- 



GENOME RESEARCH #445 



GINGERAS ET AL. 



merase extension at 72 °C for 2 min. The PGR reaction was 
completed by primer extensions lasting for 10 min at 72°C. 
Unincorporated nucleotides and primers were removed by fil- 
tration through Microcon 100 columns (Amicon Inc. Beverly, 
MA). The 705-bp fragments of rpdB were cloned from the am- 
plicon into a pT7/T3 alH plasmid (GIBCO BRL, Gaithersberg. 
MD) by use of BamHl and HinAWh linking into DHllS com- 
petent E. coli (Life Technologies. Gaithersberg, MD). 

Nucleotide Sequencing Using High-Density 
Oligonucleotide Arrays and Dideoxynucleotide 
Methods 

Dideoxynucleotide chain termination sequencing of the rpoB 
and 16S genes (clones and PGR ampiicons) was carried out on 
ABI instruments (models 373 and 377) by use of the cycle 
sequencing protocol recommended by the manufacturer (Per- 
kin-Elmer Cetus, Foster Gity, GA). The primers used for the 
dideoxynucleotide sequencing of the rpoB gene were T3 (5 - 
ATTAAGGGTGACTAAAGGGA-3'), T7 (5'-TAATACGACTCAC- 
TATACCC-3'), 105A (5'-GACCACAACAACCCGC-3'), 105Z 
(5'-GCGGGTTGTTCTGGTC-3'). 276A (5'-GGCTCGCTGTCG- 
GTGTA-3'). 276Z (5'-TACACCGACAGCGAGCC-3'). 372Z (5'- 
C GTGG G GG GTCAGGTA - 3 ') , 377A (5'-CACCGCCGACGAG- 
CAC-3 ? ), 537A (5 '-CAG ATGGTG TCGGTGG -3 ') , and 537Z (5'~ 
GAGGGAGACGATGTG-3'). The primers used for the sequence 
determination of 180 bp the 16S genes were 312Z (5'- 
GTCACCCCACCAACAAG-3') and T3 (see above). Unincorpo- 
rated dye terminators and primers were separated from the 
extension products by ethanol precipitation and the samples 
were dried in a vacuum centrifuge. The samples were resus- 
pended in a loading buffer (5:1 deionized formamide/50 mM 
EDTA, at pH 8.0) and heat denatured at 90°C for 5 min prior 
to electrophoretic analysis with 6.0% and 4.25% polyacryl- 
amide sequencing gels. The sequence data was edited and 
assembled by use of the Sequencher software package (Gene 
Codes Corp., Ann Arbor.. MI). Distances between clusters of 
isolates were determined by use of Jukes-Cantor neighbor 
joining algorithm as part of the University of Wisconsin CCG 
software package Oukes and Cantor 1989). The stability of the 
tree generated was verified by standard bootstrapping meth- 
ods (Efron and Tibshirani 1993) consisting of 1000 cycles. 

Fluorescently labeled RNA targets for chip-based se- 
quencing were produced by a method similar to that de- 
scribed previously by Kozal et al.(1996). Briefly, PGR primers 
rpoB-F and rpoB-R were resynthesized to contain T3 or T7 
promoter sequences. After PGR, fluorescein labeled RNA am- 
piicons were generated by use of these primers in an in vitro 
transcription reaction with removal of unincorporated 
nucleotides accomplished by filtration through Microcon 100 
columns. For each hybridization reaction -20 im of the fluo- 
rescein- labeled RNA was fragmented in 30 mM MgCi 2 at 95 6 C 
for 30 min to generate oligomeric -sized RNA fragments. The 
fragmented RNA was hybridized for 30 min at 22°C in a vol- 
ume of 500 pi of 6 x SSPE, 20% deionized formamide and 
0.005% Triton X-100 by use of a fluidics station (Affymetrix. 
Santa Clara, CA). The high-density oligonucleotide array was 
then washed under high stringency in 1 X SSPE, 20% deion- 
ized formamide. 0.005% Triton, followed by a short low strin- 
gency wash in 6x SSPE, 0.005% Triton. The chip was then 
analyzed with a confocal scanner (Affymetrix, Santa Clara, 
CA) at 1 1.25 pm/pixel resolution at 22°C. Data analysis, base- 
calling, and alignment of sequences was performed with Ge- 
neChip software (Affymetrix. Santa Clara, CA). 



Pattern Discovery in Hybridization Images 

Cluster analysis was used as an exploratory tool to examine 
whether each of the isolates could be grouped on the basis of 
the similarity of their hybridization patterns. For each isolate, 
the hybridization pattern was represented on an array of 5640 
probe intensities (four probes per nucleotide, one perfect- 
match probe, and three mismatch probes). Each of the probe 
intensities is uniquely identified by its chip coordinates and, 
thus, can be correlated across each array. Two methods for 
determining the distance (dissimilarity) between pairs of iso- 
lates were used. The first method used only the 14102 perfect- 
matched probes and was based on linear regression analysis 
(Salvatore 1982; Hamilton 1992). The distance between two 
isolates was represented as (1 - r 2 ), where i z is the square of 
the correlation coefficient between the matched probe inten- 
sities distributed over the 1410 perfect-matched probes. For 
this metric, similarity can be viewed as the extent to which 
the ranks of probe Intensities is preserved between the two 
isolates. This measure is invariant to chip- wise linear trans- 
formations of the probe intensities. An alternative way to vi- 
sualize this similarity measure is to consider that each of the 
two arrays of length /) is a set of coordinates of a point in 
/i-dimensional space {n = 1410). The correlation coefficient 
between the two arrays is the cosine of the angle formed be- 
tween two vectors drawn from the origin to each of these 
points. 

The second method uses all 5640 probe intensities with- 
out regard for which probes are complementary to the wild- 
type M. tuberculosis rpoi? gene sequence. First, probe intensities 
for each isolate were standardized to a zero mean and unit 
variance by a normal score transformation (Blom 1958; Tukey 
1962). The variance, over isolates, for each of the probe in- 
tensities was computed. Probe intensities that have little vari- 
ability over isolates are, in most cases, not very informative 
about differences among the samples. Therefore, probe Inten- 
sities with variances in the top 1 0% were retained (564 probe 
intensities). A principal components analysis on the 
564 X 564 covariance matrix of the retained intensities was 
performed identifying 15 principal components that ac* 
counted for 93% of the observed variance among the 121 
isolates. For each isolate, the corresponding 15 principal com- 
ponent scores were computed. The 15 principal component 
scores are mutually orthogonal. Each isolate is represented as 
a point in a 15-dimentional Euclidian space. The distance 
between two isolates was defined as the Euclidian distance 
between the two points in this space. 

Each of the two methods produced a 121 x 121 inter- 
isolate distance matrix. These matrices were input into a hi- 
erarchical clustering procedure (20). The observations were 
clustered by use of four linkage methods: single, average, 
complete, and Ward's minimum variance {Ward 1963). The 
clustering structures were similar for each of the four meth- 
ods; the results for the single-linkage clustering is shown. The 
results of cluster analysis are visualized In two ways: as den- 
drograms (Johnson 1967) or as color grids derived by rear- 
ranging the rows and columns of the distance matrix to cor- 
respond to the obtained clustering structure and then display- 
ing the pairwlse distances with a color palette to represent the 
range of distances (noncorrelative = red to correlative = blue). 

To assess the cohesion or stability of the resulting clus- 
ters, we performed a bootstrap analysis. For each cluster 
analysis, we ran 1000 bootstrap replications, where for each 
replication we resampled, with replacement, 121 observations 
from the original data set. Cluster analysis was performed on 
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each replication. For each replication and for every cluster 
obtained in the original analysis, we computed the percentage 
of the observations that belonged to that cluster. The confi- 
dence values we report are these percentages averaged over 
the 1000 bootstraps. 
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